Finding the Most Dangerous Road in Maryland

Andrew McNamara

There are many roads here in Maryland witha reputation for being dangerous. For example, here in College Park, Route One has something of a reputation for being dangerous.
For this project I wanted to evaluate which roads in Maryland are actually the most dangerous and see if I could get some insights about why this may be the case. I found a dataset on Maryland's opendata of all crashes in Maryland since 2015. I downloaded it locally since opendata.maryland.gov goes down for maintenence not infrequently, but attatched is a link to where I got the dataset: https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu.

Collection and Curation¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn
import folium
from datetime import datetime
crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')

    
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex
/tmp/ipykernel_207/2716668262.py:10: DtypeWarning: Columns (33,45) have mixed types. Specify dtype option on import or set low_memory=False.
  crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')

The dataset was large to a point where my computer was struggling, so since I'm looking to evaluate roads in Maryland, I took only crashes with latitude and longitude values, since I planned on mapping to see where crashes occur. I also reduced the dataset by only looking at 2022 and Q3 since it was the most recent complete quarter of data available and I want to find out what the current most dangerous road is.

In [2]:
crashes = crashes[pd.notnull(crashes['LATITUDE'])]
crashes = crashes[pd.notnull(crashes['LONGITUDE'])]
crashes = crashes[crashes['YEAR'] == 2022]
crashes = crashes[crashes['QUARTER'] == 'Q3']
crashes['datetime'] = crashes['ACC_DATE'].astype(str)+crashes['ACC_TIME'].astype(str)
crashes['datetime'] = crashes['datetime'].apply(lambda x: datetime.strptime(x,"%Y%m%d%H:%M:%S"))
display(crashes)
YEAR QUARTER LIGHT_DESC LIGHT_CODE COUNTY_DESC COUNTY_NO MUNI_DESC MUNI_CODE JUNCTION_DESC JUNCTION_CODE ... FEET_MILES_FLAG DISTANCE_DIR_FLAG REFERENCE_NO REFERENCE_TYPE_CODE REFERENCE_SUFFIX REFERENCE_ROAD_NAME LATITUDE LONGITUDE LOCATION datetime
726311 2022 Q3 Daylight 1.0 Washington 21.0 NaN 0.0 Not Applicable 0.00 ... F W 714.0 CO NaN SUMMERS LA 39.595377 -77.681919 POINT (-77.6819192 39.5953773) 2022-07-07 11:14:00
726344 2022 Q3 Daylight 1.0 Wicomico 22.0 NaN 0.0 Intersection 2.00 ... F S 23.0 CO NaN NORRIS TWILLEY RD 38.515646 -75.709238 POINT (-75.709238 38.5156459) 2022-07-09 17:09:00
726594 2022 Q3 Daylight 1.0 Wicomico 22.0 NaN 96.0 Intersection 2.00 ... F E 313.0 MD NaN DELMAR RD 38.462580 -75.750820 POINT (-75.75082 38.46258) 2022-08-03 19:50:00
726629 2022 Q3 Daylight 1.0 Worcester 23.0 NaN 0.0 NaN 9.04 ... F N 0.0 UU NaN SPUR TO US 50 38.327654 -75.106503 POINT (-75.106503303235 38.327653546825) 2022-08-11 13:00:00
726712 2022 Q3 Daylight 1.0 Baltimore City 24.0 NaN NaN Intersection 2.00 ... F E NaN NaN NaN DRUID HILL AVE 39.297891 -76.612438 POINT (-76.61243800045 39.297890781133) 2022-09-14 09:07:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
848799 2022 Q3 Daylight 1.0 Anne Arundel 2.0 NaN 0.0 Not Applicable 0.00 ... M E 607.0 CO NaN OLD STAGE RD 39.158164 -76.628893 POINT (-76.62889296443 39.158163996637) 2022-07-16 13:48:00
848800 2022 Q3 Daylight 1.0 Baltimore 3.0 NaN 0.0 NaN 88.00 ... F N 1300.0 CO NaN REGESTER AVE 39.378626 -76.608568 POINT (-76.608567616667 39.378625966667) 2022-08-26 18:13:00
848801 2022 Q3 Daylight 1.0 Baltimore 3.0 NaN 0.0 Not Applicable 0.00 ... F W 4095.0 CO NaN EBENEZER RD 39.387539 -76.418305 POINT (-76.418304879322 39.387539278395) 2022-07-12 16:53:00
848802 2022 Q3 Daylight 1.0 Anne Arundel 2.0 NaN 0.0 Non Intersection 1.00 ... F N 12.0 SR NaN OCEANIC DR 39.019090 -76.403650 POINT (-76.40365 39.01909) 2022-08-11 17:46:00
848803 2022 Q3 Dark Lights On 3.0 Baltimore 3.0 NaN 0.0 Non Intersection 1.00 ... F E 2189.0 CO NaN PARMELEE RD 39.367800 -76.762110 POINT (-76.76211 39.3678) 2022-08-02 01:00:00

26392 rows × 56 columns

Exploration¶

I started out with some basic data exploration. I got the number of crashes for each road documented. I then narrowed down to the 200 most dangerous roads in Maryland as defined by number of crashes. By this metric, Route 1 is the third most dangerous road in Maryland in quarter 3 of 2022, with 517 crashes. This metric is admittedly flawed, since it doesn't take into account road length or capacity or anything of the sort, but for exploration, seeing where the most crashes take place by sheer number is useful.

In [3]:
roadData = crashes['RTE_NO'].value_counts()
roadData = pd.DataFrame({'road':roadData.index,'crashes':roadData.values})
roadData = roadData.head(100)
display(roadData)
display(roadData[roadData['road']==1.0])
road crashes
0 95.0 1129
1 695.0 542
2 1.0 517
3 50.0 449
4 40.0 448
... ... ...
95 114.0 31
96 33.0 31
97 564.0 30
98 117.0 30
99 340.0 30

100 rows × 2 columns

road crashes
2 1.0 517

Hypothesis testing on number of crashes¶

To figure out if the number of crashes on any given road is significantly different from the average number of crashes on the 200 most crashed on roads, I did a Z-test, since the sample size is large. I found that Route One does have significantly more crashes than the other top 100 roads in Maryland, witha p value of .004.

In [4]:
import scipy.stats
roadData['z-score'] = roadData['crashes'].apply(lambda x:(roadData['crashes'].mean()-x)/roadData['crashes'].std()) 
roadData['p-value'] = roadData['z-score'].apply(lambda x: scipy.stats.norm.sf(abs(x))*2)
roadData['Significant'] = roadData['p-value'].apply(lambda x: x<.05)
display(roadData.head(10))
road crashes z-score p-value Significant
0 95.0 1129 -6.766122 1.322797e-11 True
1 695.0 542 -2.839874 4.513129e-03 True
2 1.0 517 -2.672658 7.525298e-03 True
3 50.0 449 -2.217828 2.656654e-02 True
4 40.0 448 -2.211140 2.702616e-02 True
5 70.0 367 -1.669358 9.504652e-02 False
6 301.0 326 -1.395122 1.629790e-01 False
7 270.0 323 -1.375056 1.691140e-01 False
8 5.0 312 -1.301481 1.930939e-01 False
9 495.0 310 -1.288104 1.977099e-01 False

Now that I have the 100 most dangerous roads in Marlyand by sheer number of crashes, I want to go back to the original dataset to figure out if I can learn about why these roads are so dangerous. To do this, I took my dataset and removed all entries that didn't take place on my list of the top 100 most dangerous roads so that I could look at their crashes in the context of the full dataset.

In [5]:
roads = roadData['road'].tolist()
crashes = crashes[crashes['RTE_NO'].isin(roads)]

I mapped all of the crashes from the top ten most crashed on roads on Maryland in order to see if this could provide us any insight about where the most crashes occur. After limiting the crash data to only those belonging to the top ten roads, I mapped outliers in the data in a dark blue and the rest in a lighter blue.

In [6]:
map_osm = folium.Map(location=[39.0458, -76.6413], zoom_start=8)
roads = roadData['road'].head(10)
top10roads = crashes[crashes['RTE_NO'].isin(roads)]

for index,crash in top10roads.iterrows():
    if crash['RTE_NO'] ==95 or crash['RTE_NO'] ==695 or crash['RTE_NO'] ==1 or crash['RTE_NO'] ==50 or crash['RTE_NO'] ==40:
        folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
                      color='#0E345B').add_to(map_osm)
    else: 
        folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
                      color='#5D8EC1').add_to(map_osm)
map_osm
Out[6]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We can see some interesting things. For one, Some roads only have crashes intermittently, such as route 40, which only has high crash numbers near population centers, such as Hagerstown. However, most roads have relatively uniform numbers of crashes along their length, such as 95 and 270.
We can see that our outlier roads are those going through Baltimore and DC, such as the Baltimore Beltway, Route one, and 95.
Interestingly, Route 50 is also an outlier, even though for most of it's length it's over the bay or on the peninsula as opposed to near DC or Baltimore like the rest of the outliers. We can see a high density of crashes on the bay bridge, which is decidedly not a population center. It is however something of a choke point, if people want to cross the bay, they must pass over it, meaning that it's level of traffic will be very high, leading to more crashes.

Figuring out crashes per mile¶

Now that we've done this, we should get back to the fact not all of these roads are the same size. If we get the lengths of these roads, we should be able to divide number of crashes by length to get crashes per mile and essentially standardize the crash counts, allowing us to compare these roads on equal footing. In order to do this, I got the length of roads manually and added each in since I couldn't find a good dataset of road lengths in Maryland that could be easily joined on.

In [7]:
roadData['length'] = 0
roadData.index = roadData['road']
In [8]:
roadData.at[95,'length'] = 109.01 
roadData.at[695,'length'] = 51.46
roadData.at[1,'length'] = 80.25
roadData.at[50,'length'] =150.06
roadData.at[40,'length'] =221.31
roadData.at[70,'length'] =93.62
roadData.at[301,'length'] =122.85
roadData.at[270,'length'] =34.70
roadData.at[5,'length'] =74.34
roadData.at[495,'length'] =23.02
roadData.at[97,'length'] =55.27
roadData.at[2,'length'] =79.24
roadData.at[140,'length'] =38.50
roadData.at[355,'length'] =36.75
roadData.at[450,'length'] =30.19
roadData.at[193,'length'] =26.07
roadData.at[26,'length'] =44.1
roadData.at[29,'length'] =29.51
roadData.at[13,'length'] =42.48
roadData.at[4,'length'] =64.85
roadData.at[295,'length'] =32.52
roadData.at[650,'length'] =25.89
roadData.at[3,'length'] =9.56
roadData.at[45,'length'] =30.06
roadData.at[214,'length'] =24.97
roadData.at[528,'length'] =9.04
roadData.at[210,'length'] =20.86
roadData.at[202,'length'] =13.92
roadData.at[212,'length'] =10.43
roadData.at[410,'length'] =13.92
roadData.at[32,'length'] =51.79
roadData.at[7,'length'] =22.83
roadData.at[198,'length'] =14.14
roadData.at[100,'length'] =22.05
roadData.at[15,'length'] =37.85
roadData.at[28,'length'] =37.38
roadData.at[595,'length'] =19.97
roadData.at[150,'length'] =13.01
roadData.at[68,'length'] =18.5
roadData.at[83,'length'] =34.50

def crashpermilecalculator(crashes, miles):
    if miles != 0:
        return crashes / miles
    else: 
        return np.nan
    
roadData['crashes per mile'] = roadData.apply(lambda x: crashpermilecalculator(x.crashes, x.length), axis=1)
roadData.sort_values(by='crashes per mile',ascending=False, inplace=True,)

Hypothesis testing on Crashes per mile¶

Now that we have crashes per mile, we can figure out if any of these roads have a statistically significant difference in their crashes per mile. I used a Z test due to the large sample size of 40. I chose my significance level as .05.

In [9]:
cpm_std = roadData.head(40)['crashes per mile'].std()
cpm_mean = roadData.head(40)['crashes per mile'].mean()
roadData.index = range(0,100)
def crashpermileZcalculator(cpm):
    return (cpm-cpm_mean)/cpm_std
roadData['crashes per mile z'] = roadData['crashes per mile'].apply(crashpermileZcalculator)
roadData['crashes per mile p'] = roadData['crashes per mile z'].apply(lambda x: scipy.stats.norm.sf(abs(x))*2)
roadData['cpm Significant'] = roadData['crashes per mile p'].apply(lambda x: x<.05)
print("Mean crashes per mile: ",cpm_mean)
print("Standard deviation crashes per mile: ",cpm_std)
display(roadData.head(10))
Mean crashes per mile:  5.736719044120439
Standard deviation crashes per mile:  3.0091851914620156
road crashes z-score p-value Significant length crashes per mile crashes per mile z crashes per mile p cpm Significant
0 495.0 310 -1.288104 1.977099e-01 False 23.02 13.466551 2.568746 0.010207 True
1 3.0 120 -0.017257 9.862318e-01 False 9.56 12.552301 2.264926 0.023517 True
2 528.0 110 0.049630 9.604173e-01 False 9.04 12.168142 2.137264 0.032577 True
3 695.0 542 -2.839874 4.513129e-03 True 51.46 10.532452 1.593698 0.111004 False
4 95.0 1129 -6.766122 1.322797e-11 True 109.01 10.356848 1.535342 0.124700 False
5 212.0 104 0.089762 9.284764e-01 False 10.43 9.971237 1.407197 0.159369 False
6 270.0 323 -1.375056 1.691140e-01 False 34.70 9.308357 1.186912 0.235262 False
7 202.0 104 0.089762 9.284764e-01 False 13.92 7.471264 0.576417 0.564333 False
8 410.0 103 0.096451 9.231627e-01 False 13.92 7.399425 0.552544 0.580576 False
9 193.0 192 -0.498841 6.178915e-01 False 26.07 7.364787 0.541033 0.588485 False

After doing this, we can see that Route One is only the fourteenth most dangerous road in Maryland. To find out which of these are statistically significant, we calculated z scores and p values for crashes per mile on these roads. We were able to use z scores because we have 40 samples for crashes per mile.
We now know that among the 100 most crashed on roads in Maryland, the average number of crashes per mile is 5.7, with a standard deviation of about 3.
From the p values, we can see that we have three outliers: 495, 3, and 528, which each have over 12 crashes per mile. We can see from earlier that in sheer number of crashes, only 495 even ranked in the top 10, but when we control for length, each has significantly more crashes per mile than the average.

We can now map our top 10 most dangerous roads by crashes per mile, once again using a darker blue for outliers and lighter for the rest.

In [10]:
map_osm = folium.Map(location=[39.0458, -76.6413], zoom_start=8)
roads = roadData['road'].head(10)
top10roadscpm = crashes[crashes['RTE_NO'].isin(roads)]
for index,crash in top10roadscpm.iterrows():
    if crash['RTE_NO'] ==495 or crash['RTE_NO'] ==3 or crash['RTE_NO'] ==528 :
        folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
                      color='#0E345B').add_to(map_osm)
    else: 
        folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
                      color='#5D8EC1').add_to(map_osm)
        
map_osm
Out[10]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This yield further insights. We can see that the most dangerous roads tend to be smaller roads neared to population centers. We can see more crashes where roads move near or through major cities like Washington or Baltimore. For the most part, the roads with high absolute numbers of crashes still have high crashes per mile, although the order of the rankings changes.
Interestingly, we can still see that the most dangerous roads are still tend to be nearer population centers. This was also true for the most dangerous roads by sheer number of crashes.
The outliers are interesting.
495 is right outside DC in a very busy area, that combined with it not having any length outside of this populated area to offset the crashes per mile leads to it having a very high number of crashes per mile, which makes sense.
Route 3 has two densely crashed in areas: near Glen Burnie and near Crofton. This is a densely traveled and small road, which leads to the high crashes per mile.
528 being here is strange. I looked a little more into it below.

Why is 528 here?¶

528 is the most interesting road here. It's nowhere near Baltimore or DC, where we've had the most crashes in raw number. Instead, it's near ocean city. Ocean city sees a lot of seasonal travel and I'd exclusively looked at quarter 3 of 2022 to make my data more readable. This made me wonder if I'd accidentally captured a seasonal trend. I extracted the data on 528 in the table. This showed that quarters 2 and 3 (april through september) have far more crashes. This was a seasonal trend that I'd accidentally captured.

In [11]:
rawcrashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
crashes528 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes528 = crashes528[pd.notnull(crashes528['LONGITUDE'])]
crashes528 = crashes528[crashes528['RTE_NO']==528]
ax = crashes528['QUARTER'].value_counts().plot.bar(color = "#88cdce")
ax.set_title("Seasonal crashes on 528")
/tmp/ipykernel_207/2423536550.py:1: DtypeWarning: Columns (33,45) have mixed types. Specify dtype option on import or set low_memory=False.
  rawcrashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
Out[11]:
Text(0.5, 1.0, 'Seasonal crashes on 528')

Is 528 the only seasonally affected outlier?¶

The seasonal variation on 528 raised some questions about the other outliers, so I did the same thing to them to figure out if they have seasonal variation like 528 does.

In [12]:
fig, (ax1,ax2) = plt.subplots(1,2)
crashes495 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes495 = crashes495[pd.notnull(crashes495['LONGITUDE'])]
crashes495 = crashes495[crashes495['RTE_NO']==495]
crashes3 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes3 = crashes3[pd.notnull(crashes3['LONGITUDE'])]
crashes3 = crashes3[crashes3['RTE_NO']==3]
seasonalCrashes495 = crashes495['QUARTER'].value_counts()
seasonalCrashes3 = crashes3['QUARTER'].value_counts()
ax1.bar(seasonalCrashes495.index,seasonalCrashes495.values,color = "#88cdce")
ax2.bar(crashes3['QUARTER'].value_counts().index,crashes3['QUARTER'].value_counts().values,color = "#88cdce")
ax1.set_title("Seasonal crashes on 495")
ax2.set_title("Seasonal crashes on 3")
Out[12]:
Text(0.5, 1.0, 'Seasonal crashes on 3')

We can see the other crashes aren't much affected by season. Route 3 and 495 both have slight variations in crashes per quarter, but nowhere near the amount that 528 did.

Conclusions¶

From this we can draw a few conclusions.
Qualities of dangerous roads: The most dangerous roads in Maryland are the roads closest to population centers. Especially dangerous are small, low capacity roads like Route 3 or 528.
How does Route one fare? Route One, when put into context does not have significantly more crashes than one would expect. The best predictor for which roads will have the most is how close the roads are to populations and for how long they are near those population centers. Near Baltimore and near DC are the most dangerous locations in Maryland generally.
Seasonally affected roads: The number of crashes on 528 is significantly affected by the seasons due to proximity to ocean city.
Most dangerous road: We can now definitively say that in terms of crashes per mile, 495 is the most dangerous road in Maryland.